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Abstract. We present a data structure that stores a string s[l..n] over 
the alphabet [l..er] in nHo(s) + o(n)(Ho(s) + l) bits, where Ho(s) is the 
zero-order entropy of s. This data structure supports the operators access 
and rank in time O (lglg cr), and the select operator in constant time. This 
result improves on previously known data structures using nHo(s) + 
o(n lg cr) bits, where on highly compressible instances the redundancy 
o(nlg(j) can cease to be negligible compared to the niio(s) bits that 
encode the data. The technique is based on combining previous results 
through an ingenious partitioning of the alphabet, and practical enough 
to be implementable. It applies not only to strings, but also to several 
other compact data structures. For example, we achieve (i) the first rep- 
resentation of a string s using nH^(s) + o(n)\ga bits and supporting 
access, rank, and select in poly-loglog time; (ii) faster search times for 
the smallest existing full-text self-index; (Hi) compressed permutations 
7r with times for 7r() and 7T () improved to log-logarithmic; and (iv) the 
first compressed representation of dynamic collections of disjoint sets. 



1 Introduction 

Search operators on strings have many important applications, to the point that 
one is willing to sacrifice some additional space to index the string in order 
to support the operators in less time. The most important operators serve as 
primitives to implement many others, in particular pattern matching in full- 
text databases (see, e.g., [19,6,15,20] for recent discussions): given a string s, 
s.access(i) returns the ith character of s, which we denote s[i]; s.rank a (i) returns 
the number of occurrences of the character a up to position i; and s.select a (i) 
returns the position of the ith occurrence of a in s. 

Initial results [12] provided data structures using n lg a + o(n lg cr) bits, where 
n and a are the sizes of the text and its alphabet, respectively, and lg denotes 
the logarithm in base two. The indexing space in o(n lg a) is considered asymp- 
totically "negligible" compared to the space required to hold the main data, 
while providing support for the operators in time O (lg cr) (and O (lg lg cr) in 
later results [10]), a clear improvement on naive data structures. 
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Table 1. Recent bounds and our new ones for data structures supporting access, rank 
and select. The space bound in the fourth and last rows hold only when k = o(log CT n). 
In the first row the space formula holds only for a — o(n), and is further reduced to 
nHo(s) + o(n) in the special case where a = lg ' 1 ' n. In the times in the fifth and sixth 
rows, a can be changed to min(<r, n/occ (a, s)), where a is the character involved. 
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Regularities in the string permit further reductions in the space, from n lg a 
bits down to nHk(s) bits where Hk(s) denotes the fcth-order empirical entropy 
of s (i.e., the minimum self- information of s with respect to a fcth-order Markov 
source; see Manzini [16] for a definition and discussion). The challenge of com- 
pressing the string while still supporting the operators efficiently was solved 
through various alternative data structures, using as little as nHk(s) + o(n lg a) 
bits while still supporting the operators in the same times [12,3]. 

One problem with such space is that, on highly compressible practical data, 
the o(n lg a) bits of the index are no longer negligible compared to the space 
used to encode the compressed data. Hence the challenge is to retain the efficient 
support for the operators while compressing the index redundancy as well. In 
this paper we solve this challenge in the case of zero-order entropy compression, 
that is, the redundancy of our data structure is proportional to the zero-order 
entropy of the text, plus o(n) bits. Moreover, our data structure not only still 
supports the operators in good time, but in some cases it supports them in better 
time than less compressed ones (e.g. in O (lg lg min(cr, n/occ (a, s))) time, where 
a is the character concerned): see Table 1 for a systematic comparison. 

For example, the representation by Golynski et al. [10] does not compress s 1 
and uses additional O {j^^^j — n o(lgo-) bits, but offers log-logarithmic times 
for the operations (see rows 2 and 3 of Table 1); Ferragina et al. [7] achieve 
zero-order compression plus O ^ " lg ^ lg " ^ = o(n)lga bits, supporting the op- 
erations in O (l + ig\g n ) time (see row 1 of Table 1). 

In Section 2 we show how to combine the strengths of these data structures 
[10, 7], obtaining the best from both plus compressed redundancy. The technique 
can be summarized as partitioning the alphabet into sub-alphabets according 



1 In terms of the usual entropy measures. It compresses to the fc-th order entropy of 
a different sequence (A. Golynski, personal communication). 
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to the characters' frequencies in s, storing in a multiary wavelet tree [7] the 
string that results from replacing the characters in s by identifiers of their sub- 
alphabets, and storing separate strings, each the projection of s to the characters 
of s belonging to each sub-alphabet, this time using Golynski et al.'s [10] struc- 
ture for large alphabets. We achieve a data structure that stores a string s[l..n] 
in nH (s) + o(n)(H (s) + 1) bits, thus guaranteeing that the redundancy stays 
negligible even when the text is very compressible. It supports queries in the 
times shown in Table 1 (rows 5 and 6 give two alternatives). 

Then we consider various extensions and applications of our main result. 
In Section 3 we show how to achieve compression in terms of the fcth-order 
empirical entropy Hk(s) of s, by giving up or slowing down rank and select 
queries (thus giving a simpler construction to achieve row 4 in Table 1, and with 
lower space redundancy on small alphabets). We also show how our result can 
be used to improve an existing text index that achieves fc-th order entropy [7] . In 
Sections 4 and 5, respectively, we show how to apply our data structure to store 
a compressed permutation, a compressed function and a compressed dynamic 
collection of disjoint sets, while supporting a rich set of operations on those. 
This improves or gives alternatives to the best previous results [4,18,13]. We 
have approached these applications in such a way that an improvement to our 
main result, however achieved, translates into improved bounds for them as well. 

2 Alphabet partitioning 

Let s be a sequence over effective alphabet [l..c], that is, every character appears 
in s, thus a < n (at the end of the section we handle this issue). The zero- 
order entropy of s is H (s) = Eae[i..<x] £££ ^ lg occ( " s) , where occ (a, s) is the 
number of occurrences of the character a in s. Note that by convexity we have 
nH (s) > (er— 1) lgn + (n — cr + 1) lg(n/ (n — <r + 1)), a property we will use later. 

Our results are based on the following alphabet partitioning scheme. Let t 
be the string in which each character s[i] is replaced by the number 



For < I < [lg 2 n~\ , let si be the string consisting of those characters s[i] that 
were replaced by £ in t, and let erg be the number of distinct characters in sg. 
More precisely, se will be a sequence over alphabet [l..ag]. The mapping with 
[l..er] will be done by storing another sequence m[l..cr] so that m[a] will be the 
sub-alphabet number assigned to character a. Hence character a is mapped to 
character m.rank m [ a ] (a) in string s m [ a ], and symbol x G [l.-crg] in se corresponds 
to character m.select^iz;). 

Notice that, if both a and b are replaced by the same number in t, then 
lg(n/occ (&, s)) — lg(n/occ (a, s)) < 1/lgn and so occ (a, s) /occ (b, s) < 2 1 / lgn . 
It follows that, if a is replaced by I in t, then ae < 2 1 / lg "|s£|/occ (a, s) (by fixing 
a and summing over all those b replaced by I ) . Since 



t[i\ = rig(n/occ(s[i],s))lgn] < [lg 2 n] . 





and 




[lg(n/occ(a,s)) lg n\=l 
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we have 



nH (t) + ^2\s e \\g<Ti 



i 



occ 



(a,s)lg (2^|s^|/occ(a,s) 




occ (a, s) lg(n/occ (a, s)) + n/ lgn 



a 



= nHo(s) + o(n) . 

In other words, if we represent t with i?o(i) bits per symbol and each S£ 
with lg ag bits per symbol, then we achieve a good overall compression. Thus 
we can obtain a very compact representation of a string s by storing a compact 
representation of t and storing each si as an "uncompressed" string over an 
alphabet of size at. . 

Now we show how our approach can be used to obtain a fast and compact 
rank/select data structure. Suppose we have a data structure T that supports 
access, rank and select queries on t: another structure M that supports the same 
queries on to; and data structures Si,..., S^ lg 2 „^ that support the same queries 
on si, . . . , sp lg 2 „^ . With these data structures we can implement 

s.access(z) = Lselect^(s£.access(i.rank£(i))), where t = t.access(i); 
s.rank a (z) = S£.rank c (i.rank^(i)), where £ = m.access(a) and c = to. rank^(a); 
s.select a (z) = f.select^(s£.select c (z)) where t = m.access(a) and c = TO.rank^(a). 

We implement T and M as multiary wavelet trees [7]; we implement each 
Sg as either a multiary wavelet tree or an instance of Golynski et al.'s [10] 
access/rank/select data structure, depending on whether <jg < lgn or not. The 
wavelet tree for T uses at most nH (t) + o(n) bits and operates in constant time, 
because its alphabet size is poly logarithmic. If Sg is implemented as a wavelet 
tree, it uses at most \se\Ho(sg) + o(\se\) bits and operates in constant time for 

the same reason; otherwise it uses at most \se \ \%ot + O ( ^gig 8 ^ ) — l s ^l Ig^ + 
O ( ig^ig fg^n ) bits (the latter because ag > lgn). Thus in all cases the space for 

S£ is bounded by \sg\ \gae + o(\se\ \gae) bits. 2 Finally, since M is a sequence of 
length a over an alphabet of size [lg n\ , the wavelet tree for M takes at most 
O (crlglgn) bits. Because of the property we referred to in the beginning of this 
section, nHo(s) > (a— 1) lgn, this space is also o(nHo(s)). By these calculations, 
the space for T, M and the Sg's adds up to nH (s) + o(nH (s)) + o(n). 

2 We note that this o(-) expression is in all cases asymptotic in n: in the case of 
multiary wavelet trees, it is achieved by using block sizes of length and not 
lg ^ e ^ , at the price of storing universal tables of size O (^/nlg 01 - 1 ^ 1 n). 
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Depending on which time tradeoff we use for Golynski et al.'s data struc- 
ture, we obtain the results of Table 1. We can refine the time complexity by 
noticing that the only non-constant times are due to operating on some string 
st, where the alphabet is of size at < 2 1 / lgn \sg\/occ(a, s), where a is the 
character in question. Thus the actual times arc, for example, O (lglg 01) = 
O (lg lg min(cr, n/occ (a, s))). 

Theorem 1. We can store s[l..n] over effective alphabet [l..er] in nH (s) + 
o(n)(H (s) + 1) bits and support access, rank and select queries in O (lglg a), 
O(lglgcr), and O (1) time, respectively (variant (i)). Alternatively, we can sup- 
port access, rank and select queries in 0(1), O (lg lg a lg lg lg a) and O (lglg a) 
time, respectively (variant {it)). All the a terms in the time complexities are 
actually min(a, n/occ (a, s)), where a — s[i] for access, and a is the character 
argument for rank and select. 

In the most general case, s is a sequence over an alphabet E which is not 
an effective alphabet, and a symbols from E occur in s. Let E' be the set of 
elements that occur in s; we can map characters from E' to elements of [l..<r] 
by replacing each a G E' with its rank in E'. All elements of E' are stored in 
the indexed dictionary data structure described by Raman et al. [21], so that 
the following queries are supported in constant time: for any a e E' its rank 
in E' can be found (for any a £ E' the answer is —1); for any i e [l-.c] the 
i-th smallest element in E' can be found. The indexed dictionary of Raman et 
al. [21] uses a\g(em/a) + o(er) + O(lglgm) bits of space, where e is the base of 
the natural logarithm and m is the maximal element in E'\ the value of m can 
be specified with additional 0(lg m) bits. We replace every element in s by its 
rank in E'; the resulting string can be stored using Theorem 1. Hence, in the 
general case the space usage is increased by a\g(em/a) + o(a) + 0(\gm) bits 
and the asymptotic time complexity of queries remains unchanged. 

3 Higher-order entropies 

Sadakanc and Grossi [23] showed how to store s in nHk(s) + o{n) \ga bits, for 
k = o(log CT n), and still support O (l)-time access (and, moreover, retrieving 
any O (log CT n) contiguous symbols in O (1) time). Their scheme was simplified 
by Gonzalez and Navarro [11] and then further simplified by Ferragina and 
Venturini [8] . The main idea is to break s into n' blocks of length b = los 2 g " and 
store the sequence s' of blocks in n'H (s') + O (n') bits while still supporting 
O (l)-time access to the blocks; with O {a h b\ga) = o(n) more bits, we can 
support O (l)-time access to the individual characters. Ferragina and Venturini 
showed that n'H (s') < nHk(s) + O ((n/b)k\ga) for k < b, so for k = o(log CT n) 
we use a total of at most nHk(s) + o(n) \ga bits. Notice that, with Theorem 
we need only n'H (s') + o(n')(H (s') + 1) bits to store s' (although sadly this 
does not improve the final bound). 

3 In this case we do not need [l..cr] to be the effective alphabet: In order to achieve any 
k > 0, we need (at the very least) that a — o(n), and thus the extra space for our 
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Barbay, He, Munro and Rao [3] showed how, given a data structure that sup- 
ports only access queries, it is possible to build a succinct index of size n o(lg cr) 
bits that supports rank and select. For example, building a succinct index on top 
of Ferragina and Venturini's data structure — whose size is bounded in terms 
of Hk(s) — gives them the bounds shown in the fourth row of Table 1. We now 
show how to reduce the size of the succinct index to o(n) lg a bits, giving us the 
bounds in the last row of the table. 

Theorem 2. We can store s[l..n] in nHk(s) + o(n) lg cr bits, for k — o(log CT n), 
and support access, rank and select queries in O (1), O ((lglgcr) 2 lglglgcr), and 
O (lg lg cr lg lg lg a) time, respectively. 

Proof. We use Ferragina and Venturini's data structure [8] to store the sequence 
s' of blocks in nHk(s) + o{n)lga bits such that we can access any block or 
character in O (1) time. We then store the distinct blocks that occur in s' in 
a data structure B such that we can perform rank and select queries on the 
contents of any block in O (lglgcr) time; since there are a b = O (^/n) possible 
distinct blocks, this takes O {^Jn{blga)) = o(n) bits. 

We build a binary relation R in [l..cr] x so that R(a,j) if and only 

if character a appears in block s'[j]. Rather than storing R explicitly, we store 
only its succinct index [3, Theorem 3.2], which uses Ferragina and Venturini's 
data structure and takes O (plgc/ lg lg cr) more bits, where p < min(n, n'a) 
is the number of pairs in R. If a = O (1) then O {pig a/ lglgcr) = O [n') = 
o(n); otherwise, O (pig aj lglgcr) = O (nlgcr/lglglgn) = o(n)lgcr. This suc- 
cinct index allows us to find the number of blocks in s'[l..j] containing a's in 
O ((lglgcr) 2 lglglgcr) time, and to find the rth block containing a's in time 
C(lglgcrlg lglgcr). 

For each of the p pairs (a,j) in R, we store the number of times a appears in 
S' [j] . Assume we store these numbers X ciS X + 1 bits 1^0 and concatenate them 
all, row-wise, into a bitmap P with n Is and p 0s. Calculations similar to those 
above show we can store P in o(n) lg a bits and support rank and select queries 
in O (1) time [21]. With P we can compute, in O (1) time, the total number of 
a's across the first r blocks containing a's, and the smallest r such that the first 
r blocks containing a's contain at least j of them. 

To perform s.rank a (j), we compute the index j' = \j/b~\ in s' of the block 
containing the jth character of s; use Ferragina and Venturini's data structure 
and B to find number of a's within s'[j'] up to position j mod 6; use R to 
find the number of blocks in s'[l..j' — 1] that contain a's; and use P to find 
the total number of a's in those preceding blocks. To perform s.select a (j), we 
use P to find the smallest r such that the first r blocks containing a's contain 
at least j of them; use P again to find the total number t of a's in the first 
r — 1 blocks containing a's; use R to find the rth block s'[j'} containing a's; and 
use Ferragina and Venturini's data structure and B to find the position of the 
(j -t)th a in s'\j'}. □ 



structure M is O (a lglgn) = o(crlgn) = o(nlga), already considered in the space 
formula. 
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Alternatively, we can achieve fc-th order entropy by partitioning the Burrows- 
Wheeler transform of the sequence and encoding each partition to zero-order 
entropy [7]. This representation, however, is more useful for self-indexing than 
for supporting access, rank, and select on the sequence. Self- indexes also represent 
a sequence, but they support other operations related to text searching. By using 
Theorem l(i) to represent the partitions of Ferragina et al. [7] , on which we need 
to carry out access and rank, we achieve the following result, improving previous 
ones [7, 10]. 

Theorem 3. Lett[l..n] be a text string over alphabet [l..a], such that a = o(n). 4 
Then we can represent s using nHk (s) + o(n) lg a bits of space, for any k < 
eulogy n and constant < a < 1. The following operations can be carried out: (i) 
count the number of occurrences of a pattern p[l..m] in t, in time O (mlglgcr); 
(ii) locate any such occurrence in time O (log CT n lg lg n lg lg a); (iii) extract t\l, r] 
in time O ((r — I + log CT n lg lg n) lglger). 

For this particular locating time we are sampling one out of log^ n lg lg n text 
positions. 

4 Compressing permutations 

We now show how to use access/rank/select data structures to store a com- 
pressed permutation. We follow Barbay and Navarro's notation [4] and improve 
their space and, especially, their time performance. They measure the compress- 
ibility of a permutation it in terms of the entropy of the distribution of the 
lengths of runs of different kinds. Let it be covered by p runs (using any of the 
previous definitions of runs [14,4,17]) of lengths runs(7r) = (n\, . . . ,n p ). Then 
i7(runs(7r)) = ^ lg < lg /o is called the entropy of the runs (and, because 
ni > 1, it also holds nH (runs(7r)) > (p — 1) lgn). We first consider permutations 
which are interleaved sequences of increasing or decreasing values as first defined 
by Levcopoulos et al. [14] for adaptive sorting, and later on for compression [4], 
and then give improved results for more specific classes of runs. In both cases 
we consider first the application of the permutation ir() and its inverse, 7r _1 (), 
to show later how to extend the support to the iterated applications of the 
permutation, 7r fe (), extending and improving the previous results of Munro et 
al. [18]. 

Theorem 4. Let n be a permutation on n elements that consists of p interleaved 
increasing or decreasing runs. We can store tt in 2nH (runs(7r))+o(n)(iJ(runs(7r))-|- 
1) bits and perform n() and 7r _1 () queries in O (\g\gp) time. 

Proof. We first replace all the elements of the rth run by r, for 1 < r < p. Let s 
be the resulting string and let s' be s permuted according to ir. We store s and 
s' using Theorem and store p bits indicating whether each run is increasing 



4 Again, [l..cr] does not need to be the effective alphabet. 
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or decreasing. Notice s'[7r(i)] = s[i] and, if n(i) is part of an increasing run, then 
s / .rank s[i] (7r(i)) = s.rank s[i ](i), so 

n(i) = s'.select s [i] (s.rank s[i] (i)) ; 

if ir(i) is part of a decreasing run, then s'.rank s [j](7r(i)) = s.rank s [j] (n) + 1 — 
s.rank s[i] (i), so 

ir(i) = s'.select s [i] (s.rank s [j](n) + 1 — s.rank s [ i ](i)) . 

A 7r _1 () query is symmetric. The space of the bitmap is p = o(nH (runs(n))) 
because ni? (runs(7r)) > (p — 1) lgn. □ 

We now consider the case of runs restricted to be strictly incrementing (+1) 
or decrementing (—1), while still letting them be interleaved: such runs were not 
directly considered before. 

Theorem 5. Let n be a permutation on n elements that consists of p interleaved 
incrementing or decrementing runs. For any constant e > 0, we can store tt in 
niJ(runs(7r)) + o(n)(H (runs(7r)) + 1) + O (pn e ) bits and perform 7r() queries in 
O(lglgp) time and 7r _1 () queries inO(l/e) time. 

Proof. Due to space constrains, we leave the proof of this theorem to the ap- 
pendix. □ 

Notice that, if tt consists of p contiguous increasing or decreasing runs, then 
7T _1 consists of p interleaved incrementing or decrementing runs. Therefore, The- 
orem 5 applies to such permutations as well, with the time bounds for 7r() and 
7r _1 () queries reversed. 

Theorem 6. Let n be a permutation on n elements that consists of p contigu- 
ous increasing or decreasing runs. For any constant e > 0, we can store tt in 
nH(runs(n)) + o(n)(H (runs(7r)) + 1) + O (pn e ) bits and perform tt() queries in 
0(l/e) time and / n~ 1 () queries in O (\g\g p) time. 

If 7r's runs are both contiguous and incrementing or decrementing, then so 
are the runs of ir^ 1 . In this case we can store ir in O (pn e ) bits and answer 
7r() and 7r _1 () queries in O (1) time. To do this, we use two predecessor data 
structures: for each run, in one of the data structures we store the position j 
in tt of the first element of the run, with ir(j) as auxiliary information; in the 
other, we store n(j), with j as auxiliary information. To perform a query ir(i), 
we use the first predecessor data structure to find the starting position j of the 
run containing i, and return ir(j) +i — j. An~ 1 () query is symmetric. Decreasing 
runs are handled as before. 

Theorem 7. Let tt be a permutation on n elements that consists of p contiguous 
incrementing or decrementing runs. For any constant e > 0, we can store tt in 
O (pn e ) bits and perform tt{) and tt~ 1 () queries in O (1/e) time. 
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We now show how to achieve exponentiation (n k (i), n~ k (i)) within com- 
pressed space. Munro et al. [18] reduced the problem of supporting exponenti- 
ation on a permutation tt to the support of the direct and inverse application 
of another permutation, related but with quite distinct runs than tt. Combining 
any of our results (Theorem 4 to 7) with this technique does yield a compression, 
but without any relation to the runs of tt. The following construction, extending 
the technique from Munro et al. [18], retains the compressibility properties of 
tt by building a companion data structure that uses small space to support the 
exponentiation, thus allowing the compression of the main data structure with 
any of Theorems 4 to 7. 

Theorem 8. Suppose we have a data structure D that stores a permutation tt 
on n elements and supports queries tt() in time g(ir). Then for any t < n, we can 
build a succinct index that takes O ((n/t) lgn) bits and, when used in conjunction 
with D, supports 7r fe () and n~ k () queries in O (tg(ir)) time. 

Proof. We decompose tt into its cycles and, for every cycle of length at least t, 
store the cycle's length and an array containing pointers to every ith element 
in the cycle, which we call 'marked'. We also store a compressed binary string, 
aligned to tt, indicating the marked elements. For each marked element, we record 
to which cycle it belongs and its position in the array of that cycle. 

To compute TT k (i), we repeatedly apply 7r() at most t times until we either 
loop (in which case we need apply 7r() at most t more times to find tt (i) in the 
loop) or we find a marked element. Once we have reached a marked element, 
we use its array position and cycle length to find the pointer to the last marked 
element in the cycle before TT k (i), and the number of applications of tt() needed 
to map that to n k (i) (at most t). A TT~ k query is similar (note that it does not 
need to use 7r _1 ()). □ 

As an example, given a constant e > and a value t < n, we can com- 
bine Theorems 6 and 8 to obtain a data structure that stores Sadakane's \P 
function [22] for s in nH (s) + o(n)(H (s) + 1) + O (cm e + (n/t) lgn) bits and 
supports TT k {) and 7r~ fe () queries in 0(l/e + t) time; these queries are useful when 
working on compressed suffix arrays and trees. 

5 Compressing functions and dynamic collections of 
disjoint sets 

Hrcinsson, Kr0yer and Pagh [13] recently showed how, given X = {xi, . . . , x n } C 
[U] and / : [U] — > [l..<r], where [U] is the set of numbers whose binary rep- 
resentations fit in a machine word, they can store / restricted to X in com- 
pressed form with constant-time evaluation. Their representation occupies at 
most (1 + S)nHo(f) + nmin(p max + 0.086, 1.82(1 — _p ma x)) + o(a) bits, where 
5 > is a given constant and p max is the relative frequency of the most common 
function value. We note that this bound holds even when a >• n. 
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Notice that, in the special case where X = [l..n] and a < n, we can achieve 
constant-time evaluation and a better space bound using either Theorems 1 or 
2. With Theorem 1 we can also find all the elements in [l..n] that / maps to a 
given element in [1..ct] (using select), find an element's rank among the elements 
with the same image, or the size of the preimage (using rank), etc. 

Theorem 9. Let f : [l..n] — > [l..c] be a surjective function. We can represent 
f using nH (f) + o(n)(H (f) + 1) bits so that any f(i) can be computed in O (1) 
time. Moreover, each element of / _1 (a) can be computed in O (lglgu) time, and 
| 1 ( ct ) | requires time O (lg lg a lg lg lg a) . Alternatively we can compute f(i) and 
\f~ 1 (a)\ in time O (lglgcr) and deliver any element of f^ 1 (a) in O (1) time. 

We can also achieve interesting results with our theorems from Section 4, 
as runs arise naturally in many real-life functions. For example, suppose we 
decompose /(l), . . . , f(n) into p interleaved non-increasing or non-decreasing 
runs. Then we can store it as a combination of the permutation n that stably 
sorts the values f(i), plus a compressed rank/select data structure storing a 
binary string b[l..n + a + 1] with a + 1 bits set to 1: if / maps i values in [l..n] 
to a value j in [l..cr] then, in b, there are i bits set to between the jth and 
(j + l)th bits set to 1. Therefore, 

f(i) = 6.ranki(6.selecto(7r(i))) 

and the theorem below follows immediately from Theorem 4. Similarly, f~ 1 (a) is 
obtained by applying 7r _1 () to the area 6.rank (6.selecti (a)) + l . . . 6.rank (6.selecti 
(a+1)), and |/ _1 (a)| is computed in O (1) time. Notice iJ(runs(7r)) = H (runs(f)) 
< H (f), and that b can be stored in O (trig f ) + o(n) bits [21]. 

Theorem 10. Let f : [l..n] — > [l..er] be a surjective function 5 with /(l), . . . , f(n) 
consisting of p interleaved non-increasing or non- decreasing runs. Then we can 
store f in 2nif (runs(/)) + o(n)(H (runs(f)) + 1) + O (trig — ) bits and compute 
any f(i), as well as retrieve any element in f^ 1 {a), in O (lglgp) time. The size 
\f^ 1 (a)\ can be computed in O (1) time. 

We can obtain a more competitive result if / is split into contiguous runs, 
but their entropy is no longer bounded by the zero-order entropy of string /. 

Theorem 11. Let f : [l..n] — > [l..<r] be a surjective function with /(l), . . . , f(n) 
consisting of p contiguous non-increasing or non- decreasing runs. Then we can 
represent f in ni?(runs(/)) + o(n)(i?(runs(/)) + l) + (pn e ) + (a lg 2) bits, for 
any constant e > 0, and compute any f(i) in O (lglgcr) time, as well as retrieve 
any element in f^ 1 (a) in 0(l/e) time. The size |/ _1 (a)| can be computed in 
0(1) time. 

Finally, we now give what is, to the best of our knowledge, the first result 
about storing a compressed collection of disjoint sets. The key point in the next 
theorem is that, as the sets in the collection C are merged, our space bound 
shrinks with the entropy of the distribution sets(C) of elements to sets. 

5 Otherwise we proceed as usual to map the domain to the effective one. 
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Theorem 12. Let C be a collection of disjoint sets whose union is [l..n]. For 
any constant e > 0, we can store C in (1 + e)nH (sets(C)) + O (\C\\gn) + 
o(n)(H(sets(C)) + 1) bits and perform any sequence of m union and find op- 
erations in a total of O ((l/e)(m + n) lglgn) time. 

Proof. Due to space constraints, we leave the proof of this theorem to the ap- 
pendix. □ 

6 Future work 

In the full paper we will discuss other directions for research. First, we can 
reduce the dependence on the alphabet size from O (a lg lg n) to O (a) by storing 
a length- restricted Shannon code in O (a) bits [9] , replacing the data structure 
M. To avoid the O (1) extra redundancy per character associated with using 
a length-restricted prefix code, we replace each character in s whose codeword 
length is at most lglgn by a distinct number in t. This increases the alphabet 
size of t by at most lg n; calculation shows our space bound increases by an 
(9(1 + 1/ lg lg n)-factor and, thus, remains at most nH n (s) + o(n)(H (s) + 1). 
Second, given any constant c, we can reduce the min(er, n/occ (a, s)) in our time 
bounds by a factor of (lgn) c ; to do this, we further partition each sub-alphabet 
into (lg n) c sub-sub-alphabets. Third, our alphabet partitioning techniques yields 
a compressed representation of posting lists of sizes (m, . . . , n a ) which supports 
access, rank and select on the rows in time O (lglger), and uses total space for 
data and index proportional to the entropy H{m, . . . ,n a ) of the distribution 
of those sizes (if the posting lists refer to the words of a text, this is also the 
zero-order word-based entropy of the text). This is achieved by encoding the 
string of labels encountered during a row-first traversal, writing a special symbol 
(e.g. $) at each change of row. This improves the space of previously known 
data structures [2] , and improves the time complexity of previous compression 
results [5]. A more far-fetched challenge is to obtain an indexed data structure 
using space closer to nHk(s) + o(n(l + Hk(s))) bits than nHk(s) + o(n\ga), while 
still supporting the operators in reasonable time. 

Acknowledgments. Many thanks to Djamal Belazzougui for helpful comments 
on a draft of this paper. 
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A Proof of Theorems 5 and 12 

Proof. We first replace all the elements of the rth run by r, for 1 < r < p, 
considering the runs in order by minimum clement. Let s e {1, . . . , p} n be the 
resulting string. We store s using Theorem we also store an array contain- 
ing the runs' lengths, directions (incrementing or decrementing), and minima, in 
order by minimum element; and we store a predecessor data structure containing 
the runs' minima as keys with their positions in the array as auxiliary informa- 
tion. The predecessor data structure — which is based on the data structure in 
Lemma 4 of Andersson's survey paper [1] and which we will describe in the full 
version of this paper — takes O (pn e ) bits and answers queries in O (1/e) time. 
With the array and the predecessor data structure, we can retrieve a run's data 
given either its array index or any of its elements. 

If ir(i) is the jth element in an incrementing run whose minimum element 
is to, then n(i) = to + j — 1; on the other hand, if is the jth element of a 
decrementing run of length I whose minimum element is to, then = m+l—j. 
It follows that, given i, we can compute ir(i) by using the query j = s.rank s ^ (i) 
and then an array lookup at position s[i] to find to, / and the direction, finally 
computing from them. Also, given 7r(z), we can compute i by first using a 
predecessor query to find the run's array position r, then an array lookup to 
find m, I and the direction, then computing j = n(i) — m + 1 (increasing) or 
j = to + I — n(i) (decreasing), and finally using the query i = s.select r (j). □ 

Proof. We first choose an arbitrary order for the sets and use Theorem to 
store the string s[l..n] in which each s[i] is the identifier of the set containing 
i. We then choose an arbitrary representative for each set and store the repre- 
sentatives in both an array and a standard disjoint-set data structure [24]. To- 
gether, our data structures take nH(sets(C)) + (\C\ Ign)+o(n)(i?(sets(C)) + l) 
bits. We can perform a query find(i) on C by performing find(s[i]) on the rep- 
resentatives, and perform a union(i, j) operation on C by just performing the 
corresponding operation on the representatives. 

In order for our data structure to shrink as we union the sets, we keep 
track of iJ(sets(C)) and, whenever it shrinks by a factor of 1 + e, we rebuild 
our entire data structure; we will show in the full version of this paper that 
this takes O (n) time. Since if(sets(C)) is always less than lgn, we need to re- 
build only O (log 1+e lgn) = O ((1/e) lg lg n) times. Notice we can stop rebuilding 
once nH(sets(C)) = o(n) as the space bound then becomes dominated by the 
o(n)(H(sets(C)) + 1) term. □ 



